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Abstract 

This paper considers the topic of finding prior distributions when a major component of the statistical model 
depends on a nonlinear function. Using results on how to construct uniform distributions in general metric spaces, 
we propose a prior distribution that is uniform in the space of functional shapes of the underlying nonlinear 
function and then back-transform to obtain a prior distribution for the original model parameters. The primary 
application considered in this article is nonlinear regression, but the idea might be of interest beyond this case. For 
nonlinear regression the so constructed priors have the advantage that they are parametrization invariant and do 
not violate the likelihood principle, as opposed to uniform distributions on the parameters or the Jeffrey's prior, 
respectively. The utility of the proposed priors is demonstrated in the context of nonlinear regression modelling 
in clinical dose-finding trials, through a real data example and simulation. In addition the proposed priors are 
used for calculation of an optimal Bayesian design. 



1 Introduction 

Mathematical models of the real world are typically nonlinea r, examples in medical or biological applications can 



be found for instance in 



LindsevI |200l[ ) or 



Jones et al 



(|201ol ). Setting up prior distributions in a statistical anal- 



ysis of nonlinear models, however often remains a challenge. If external, numerical or non-numerical information 
exists, one can try to quantify it int o a probability distribu t ion, s ee for example the works of O'Hagan et al. (2006), 



Bornkamp and Ickstadtl (|2009l) , and 



Neuenschwander et al 



(|2010r) . The classical approach in the absence of substan- 



tive information is Jeffreys prior distribution (or variants), given by p (0) cx ^dct(/(0)), whe re 6> € C is the 
param e ter, and / (0 ) the F isher info rmation matrix of the underlying statistical model. See 



Kass and Wasserman 



Ghosh et al 



(j2006l ch. 5) or iBerger et al.l ((20091) for this approach and generalizations. A serious drawback 
is the fact that this prior can depend on observed covariates. In the case of nonlinear regression analysis, the prior 
depends on the design points and relative allocat ions to these points and thus violates the likelihood principle. Apart 



from the foundational issues this raises (see, e.. 



O'Hagan and Forsted ( 2004 



ch. 3)) it also has undesirable practical 
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Figure 1: (i) Display of the uniform distribution on 9 scale; (ii) Display of the regression function exp(— 6*2;) for 6* = 0, 
9 = 5 and the 9 corresponding to the i/10 quantile i — 1, . . . , 9 of the uniform distribution. 



consequences. For Bayesian optimal design calculations in nonlinear regression models, for example, Jeffreys prior 
cannot be used, because it depends on the design points, which is what we want to calculate in the optimal design 
problem. In the conte xt of adaptive dose- finding clinical trials, patients are allocated dynamically to the doses avail- 



able (see the works of iMiiller et al.l (j2006l) or Dragalin et al. (2010)) so that the sequential analysis of the data will 
differ from the analysis combining all data, when using Jeffreys rule. In summary the main issue with the Jeffreys 
prior distribution is that one cannot state it before data collection, which is crucial in some applications. Surprisingly 
few proposals have been made to overcome this situation. In current practice often uniform distributions for 6 on 
a reasonable compact subset of the parameter space are used. This approach is however extremely sensitive to the 
chosen parametrization (which might be more or less arbitrary) and can be much more informative than one would 
expect intuitively. 

To illustrate the point, we will use a simple example. Suppose one would like to analyse data using the exponential 
model exp(—9x), here with x e [0, 10], which could be the mean function in a regression analysis. Assume that no 
historical data or practical experiences related to the problem are available. 

A first pragmatic approach in this situation is to use a uniform distribution on 9 values leading to a reasonable shape 
coverage of the underlying regression function exp(— ^^x), for example the interval 9 e [0,5] covers the underlying 
shapes almost entirely. The consequences of assuming a uniform prior on [0, 5] can be observed in Figure[l](ii). While 
the prior is uniform in 9 space, it places most of its prior probability mass on the functional shapes that decrease 
quickly towards zero, and we end up with a very informative prior distribution in the space of functional shapes. 
This is highly undesirable when limited prior information regarding the shape is available. In addition it depends 
crucially on the upper bound selected for 9, and a uniform distribution in an alternative parameterization would lead 
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to entirely different prior in the space of shapes. One way to overcome these problems is to use a distribution that 
is uniform in the space of functional shapes of the underlying nonlinear function. This will be uninformative from 
the functional viewpoint and will not depend on the selected parameterization. 

In finite dimensional situations it is a standard approach to use distributions that are uniform in an intcrpretable 
parameter transformation, when it is difficult to use the classical default prior distributions. In the context of Dirichlet 
process mixture modelling, one can use a uniform distribution on the probability that two observations cluster into 
one group and then transfer this into a prior distribution for the precision parameter of the Dirichlet process. In the 
challenging problem of assigning a prior distribution for variance parameters in hierarchical models, iDanielsl (jl999l ) 
assumes a uniform distribution on the shrinkage coefficient and then transfers this to a prior distribution for the 
variance parameter. In these cases the standard change of variables theorem can be used to derive the necessary 
uniform distributions. When we want to impose a uniform distribution in the space of functional shapes of an 
underlying regression function, however, it is not entirely obvious how to construct a uniform distribution. In the 
next section we will review a methodology that allows to construct uniform distributions on general metric spaces. 
In Section [2.2l we will adapt this to the nonlinear models that we consider in this article. Finally in Section|3]we test 
the priors for nonlinear regression on a data set from a dose-finding trial, a simulation study and an optimal design 
problem. 



2 Methodology 



2.1 General Approach 

Suppose one would like to find a prior distribution for a parameter in a compact subspace © C MP, p < oo. 
The approach proposed in this paper is to map the parameter 9 from into another compact metric space (M, d), 
with metric d, using a differentiable bijective function : n- A/, so that (p{9) — (f) ^ M. The metric d should 
ideally define a reasonable measure of closeness and distance between the parameters, and its choice will of course be 
model and application dependent. In the exponential regression example, for instance, it seems adequate to measure 
the distance between two parameter values 9' and 9" by a distance between the resulting functions exp(— a;0') and 
exp(— x^"), rather than the Euclidean distance between the plain parameter values. In this metric space {M, d), one 
then imposes a uniform distribution, refiecting the appropriate notion of distance of the metric space (M, d), and 
transforms this distribution back to the parameter scale. 



The construction of a unifo rm distribution in ge neral metric spaces has been described bv iDembskil (jl990l ). using the 



notion of packing numbers. 



Ghosal et al 



(|l997l ) apply this result for two particular Bayesian applications (derivation 
of Jeffreys prior for parametric problems and nonparametric density estimation). In the following we review and 
adapt this theory to our situation. Some basic mathematical notions are needed to present the ideas: Define an e-net 
as a set C M, so that for all 4>' , 0" G Se holds d{4>' , 4>") > e, and the addition of any point to Se destroys this 
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property. An e-lattice 5™ is the e-net with maximum possible cardinality. Dembski defines the uniform distribution 
on M as the limit of a discrete uniform distribution on an e-lattice on M, when e — > 0. 

Definition 1 The uniform distribution 11 on M defined as 



U{A) = limn,(A), 

for A <Z M and IIj [A) is the discrete uniform distribution supported on the points in , i. e. [A) 
with I S*™ I the cardinality of S™ . 



Loosely speaking the uniform distribution is hence defined as the limit of a discrete uniform distribution on an 
equally spaced grid, where the notion of "equally spaced" is determined by the distance metric underlying {M,d). 
Even though this definition is intuitive it is not constructive. Apart from special cases, generating an e-lattice is 
computationally difficult in a general metric space; calculating the limit of e-lattices even more so. In addition it is 
unclear, wheth er there is just one limit distribution all e-lattices would converge to. To overcome these problems 
Dembskil (jl990l ) uses the closely related notion of packing numbers. The packing number D{e, A, d) of a subset A G M 
in the metric d is defined as the cardinality of an e-lattice on A, and packing numbers are known for a number of 
metric spaces. An e-pseudo-probability can then be defined as Pt{A) — -^^^jf^- It is straightforward to see that 
< Pe{A) < 1 and that P^{M) = 1, but packing numbers are sub-additive and hence is not a probability measure. 
However f or disj oint sets A' and A" with minimum distance > e additivity holds, i.e. P^{A' U A") = P^{A') + P^{A"). 
Dembskil (jl990l ) then shows that w henever l i m Pe( A) exists for any A, th en the limit distribution is the unique 



uniform distribution on (M, d) (see iDembskil (|l99(I ) or 



Ghosal et al 



(|l99lT ) for details). As packing numbers are 



known for a number of metric spaces, this result provides a constructive way for building uniform distributions, 
without the need for explicitly constructing e-lattices. 

Subsequently we consider the practically important case of a finite number of parameters p and assume that the 
metric of (M, d), d{(f),(j)^) — d{^p{0),ip{6o)) — d*{0,do) in terms of can be approximated by a local quadratic 
approximation of the form 



d*{e, Oo) = ciJc2{e - e„)Tv{eo){e - Oo) + o{\\e - 9o\ 



(1) 



where ci, C2 > are constants and fc > 3. Equation ([T]) implies that .) can locally be approximated by a Euclidean 
metric. This is not a very strong condition, for a sufficiently often diflerentiable metric d{.,9o) one can make use 
of a Taylor expansion of second order of d{.,0o)'^ and apply the square root to obtain ([1]). The following theorem 
calculates the distribution induced on 6 by imposing a uniform distributio n in {M,d), when assumption ([T]) holds. 
The proof is only a slight adaption of earlier results by 



Ghosal et al 



(jl997n . see Appendix A. 



Theorem 1 For a metric space (M,d) and a bijective function ip, fulfilling llj, where V{6) is a symmetric matrix 
with finite strictly positive eigenvalues V0 G and continuous as a function of 9, Pc{A) = ^^I'^j'^}-^ for A d @ 
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converges to 



Vdct{vie))de 

— — , as e - 

^dctiv{e))de 

The density of the uniform probability distribution is hence given by: 



m= ^^^^^^^ 



/e v/dct(V(0))d0 

We note that the last result can be obtained as well by u sing cons i derat ions based on Riemannian manifolds, in which 



case ([IJ would be the Riemannian metric: For example iPenned (|2006f ) explicitly considers uniform distributions on 



Riemannian manifolds and obtains the same result. We concentrated on Dembski's derivation as it seems both more 
general and intuitive. 

It is important to note that the so defined distribution is independent of the paramctrization. This is intuitively 
clear, as the space {M,d), where the uniform distribution is imposed, is fixed, no matter, which paramctrization is 
used. We illustrate this invariance property for the special case of a Taylor approximation in the Theorem below; 
for a proof see Appendix B. 



Theorem 2 Assume {M,d) with d{e,eQ)^ = i(0 - 6>o)'F(6'o)(0 - ©o) + Odl© - 6»o|P), where V{e) = 



evaluated at 6, which leads to a prior p{6) cx dci^V [0)) . 

When calculating the uniform distribution associated to the transformed parameter g{6) = 7, with g 



1,3 



bijective twice differentiable transformation, one o btainsp{j) cx det(i?(7))^det(y(/i(7))), wh ere h is the inverse of 
g and H{'j) — {-^h{0), . . . , ■^-h{9)) is the Jacobian matrix associated with h, which is the same result as applying 
the change of variables theorem to p{0). 

A technical restriction of the theory described in this section is the concentration on compact metric spaces 0. 
H owever, it is pos sible to extend this based on taking limits of a sequence of growing compact spaces, see the works 



of 



Dembskil ( 1990 ) and 



Ghosal et al 



(|l997l ) for details. Note that the resulting limiting density does not need to be 



integrable. 

2.1.1 Examples 
Non-functional uniform priors 

While the approach outlined in Section [53] is developed for general metric spaces, it coincides with standard results 
about change of variables, when the metric space M is a compact subset of W as well. Suppose one would like 
to use a uniform distribution for Lp{6) with Lp{6) : W ^ W a, bijective, continuously differentiable function and 
then back-transform to 6 scale. Using the standard change of variables theorem one obtains: p{9) cx | det(_D(0))|, 
where D{9) = [-^ip{9), . . . , ■^-Lp{9)) is the Jacobian matrix of the transformation •■p{9). Framed in the approach of 
the last section, the metric space Af is a compact subset of W with the Euclidean metric in the transformed space 
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d{Lp{e),ip{eo)) = y/{ip{e) - ip{eo))'^{ip{0) ~ ip{0o))- a local linear approximation to ip{e) - (^(6»o) is D{eo){0 - 0o) 



with remainder O(||6>-0o|P)- Hence, one obtains d{ip{0), ip{0o)) = ^/{e - 0oY D{0)T D{0){e - Oq) + O{\\0 ~ 0q\\^). 



Applying Theorem [T] one ends up with the desired distribution cx \/ d,&\,{p{Q^^ D{0)) — \ det(I?(0))|. 
Jeffreys Prior 



Another special case of this general approach is Jeffreys prior itself. IJeffrevd (jl96lh described his rule by noting that 
([T|) approximates the empirical Hellinger distance (as well as the empirical Kullback-Leibler divergence) between 
the residual distributions in a statistical model, when 'V (0) is the Fisher information matrix. In this situation the 
parameters of a statistical model are mapped into the space of residual densities and this space is used to define the 
notion of distance between the 0's. Applying the machinery from the last section then leads to a uniform distribution 
on the spac e of residual densities . This interpret ation of Jeffreys rule is r are, but has been noted among others for 
example by 



Kass and WassermanI (|l996l ch. 3.6) 



Ghosal et al 



(1993) and 



BalasubramanianI ([1997|) explicitly derive 



Jeffreys rule from these principles. From this viewpoint Jeffreys prior is hence useful as a universal "default" prior, 
because it gives equal weights to all possible residual densities underlying a statistical model. However, the used 
metric can depend, for example, on values of covariates, which is undesirable in the nonlinear regression application, 
as discussed in the introduction. 
Triangular Distribution 

In this example Definition [T] is directly used to numerically approximate a uniform distribution on a metric space. 
This is can be done in the case p = 1, where the construction of e— lattices is easily possible numerically. 
The triangular distribution, with density 



pix\e) 




(2) 



20091) 



for ^? G (0, 1) is a simple, yet versatile distribution, for which the Jeffreys prior does not exist (jBerger et al 
One possible metric space where to impose the uniform distribution is the space of triangular densities or triangular 
distribution functions parametrized by 9. Several metrics might be used, we will consider the Hellinger metric 

dH{0i,02) = I (^y/p{x\9i) - ^p(x|6'2)) dxj and the Kolmogorov metric dK{0i,92) = supj^g[o_i] | Jg{p{x\9i) ~ 
p{x\92))dx\. Numerically calculating the corresponding e— lattices one obtains the distributions displayed in Figure 
[21 Interestingly one can observe that the calculated uniform distribution in the Hellinger metric space is equal to a 
Beta{l/2, 1/2) distribution (as the reference prior in Berger et al. (200^), while the calculated functional uniform 



distribution in the Kolmogorov metric results in a uniform distribution on [0, 1]. 



2.2 Nonlinear Regression 

The implicit assumption when employing a nonlinear regression function ij,{x,0), is that for one the shape of 
the function /^(x, 0) will adequately describe reality. It is usually unclear, however, which of these shapes is the 
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Figure 2: Numerically calculated uniform distributions in the (i) Hellinger metric and (ii) the Kolmogorov metric. 
The solid curves are based on interpolation of the empirical distribution functions of 0.005-lattices followed by 
differentiation, the dots represent an 0.03-lattice. 

right one. A uniform distribution on the functional shapes hence seems to be a reasonable prior. A suited metric 
space is consequently the space of functions fi{.,9), with xeXcM., OgKcMP with compact K and metric 



for example given by the L2 distance d{6, 9q) — y 6) — fi{x, 6o))'^dx. By a first order Taylor expansion one 

obtains ^{x,6) — ^(x,9q) — J,j.{6q){9 — 9q) + 0{\\9 — ^olP): where Jx{Sq) = ■^^{x,9) is the row vector of first 
partial derivatives. This results in an approximation of form {fi{x,9) — 0o))^ ~ ~ (^o)^ Jxi^of" Jx{(^o){(^ ~ 
9q) + 0{\\9 — 9q\\^). Integrating this with respect to x and taking the square root, leads to an approximation of 



d{9, 9o) of form y/{9 - 9o)^Z*{9o){9 - 9q) + 0{\\9 - 6»o|P), where Z*{9) = Jx{9Y Jx{9)dx. Consequently, from 



Theorem[T]the functional uniform distribution for 9 equals p{9) cx dei{Z*{9)). In the special case of a linear model 
/Lt(x, 9) — f{x)^9, the functional uniform distribution collapses to a constant prior distribution, which is the uniform 
distribution on for compact and improper, when extending to non-compact 0. 

We now revisit the exponential regression example from the introduction. In this case one obtains Jx^O) — 



—xcxp{—9x), calculating Jg*' J2;(^)^da; and applying the square root, one obtains p(^) cx exp(— 10^) y — — 22^ — 1. 

normalizing this leads to the prior displayed in Figure [3] (i). On the 9 scale the shape based functional uniform den- 
sity hence leads to a rather non-uniform distribution. In Figure |3] (ii) one can observe that the probability mass is 
distributed uniformly over the different shapes, as desired. 

An advantage of the functional uniform prior over the uniform prior is that it is independent of the choice of 
parameterization and not particularly sensitive to the potential choice of the bounds, provided all major shapes of 
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Figure 3: (i) Display of the functional uniform distribution on 9 scale; (ii) Display of the regression function eyi\)[~6x) 
for = 0,0 = 5 and the 9 corresponding to the i/10 quantile i = 1, . . . , 9 of the functional uniform distribution. 



the underlying function are covered. In Figure [3] (i) one can see that the density is already rather small at = 5 as 
most of the underlying functional shapes are already covered. In fact in this example one can extend the functional 
uniform distribution from the compact interval [0,5] to a proper distribution on [0,oo). 

Although the choice of the L2 metric for d seems reasonable in a variety of situations, other choices are possible. 
One could for example use a weighted version of the L2 distance, when interest is in particular regions of the design 
space X . In fact Jeffreys prior can be identified as a special case, when the assumed residual model is given by a 
homoscedastic normal distribution. In this situation the empirical measure on the design points is used as a weighting 
measure. The Jeffreys prior has also been mentioned bv iBates and WattsI (jl988l p. 217), as a prior that is uniform 
on the response surfaces, but the possibility of an alternative weighting measures has not been considered. 
One potential obstacle in the use of the proposed functional uniform prior is the fact that it can be computationally 
challenging to calculate. In some of the situations it might be possible to calculate Z*{6) analytically in others one 
might need to use numerical integration to approximate the underlying integrals. However, it needs to be noted that 
the prior only needs to be calculated once, as the prior is independent of the observed data (it only depends on the 
design region X and on potential parameter bounds), and can then be approximated for example in terms of more 
commonly used distributions. This approximation can then be reused in different modelling situations. 



3 Applications 



In this section, we will evaluate the proposed functional uniform priors for nonlinear regression. One application of 
nonlinear regression is in the context of pharmaceutical dose-finding trials. A challenge in these trials is that the 



variability in the response is usually large and the number of used doses fairly small, so that the underlying inference 
problem is challenging, despite an often seemingly large sample size. The priors will first be tested in a real example, 
then the frequentist operating characteristics of the proposed functional uniform priors are assessed more formally 
in a simulation study for a binary endpoint. In the last example we will use the functional uniform distribution for 
calculation of a Bayesian optimal design in the exponential regression example. 



3.1 Irritable Bowel Syndrome Dose-Response Study 



Here the IBScovars data set taken from the DoseFinding package will be used iBornkamp et al.l (|2010[ ). The data 
were part of a dose ranging trial on a compound for the treatment of the irritable bowel syndrome with four active 
doses 1, 2, 3, 4 equally distributed in the dose range [0,4] and placebo. The primary endpoint was a baseline 
adjusted abdominal pain score with larger values corresponding to a better treatment effect. In total 369 patients 
completed the study, with nearly balanced allocation across the doses. Assume a normal distribution is used to 
model the residual error and that the hyperbolic Emax model fi{x, 6) — 9o + 9ix/{92 + x) was chosen to describe the 
dose-response relationship. The parameters ^-nd 9i determine the placebo mean and the asymptotic maximum 
effect, while the parameter ^2 determines the dose that gives 50 percent of the asymptotic maximum effect, so that 
it determines the steepness of the curve. In clinical practice vague prior information typically exists for 6q and 9i, 
but for illustration here we use improper constant priors for these two parameters and a prior proportional to 
for a. For the nonlinear parameter 62 we will use a uniform prior and the functional uniform prior distribution. 
When using a uniform distribution for 02 it is necessary to assume bounds, as otherwise an improper posterior 
distribution may arise. We will use the bounds [0.004, 6] here, the selection of the boundaries is based on the fact 
that practically all of the shapes of the underlying model are covered taking into account that the dose range is 
[0,4]. For comparability for the functional uniform prior the same bounds were used, although one can extend it 
to an integrable density on [0,cx)). The functional uniform prior will be used based on the function space defined 
by x/{92 + x). Performing the calculations described in Section [2^ one obtains = x'^/{x + ^2)^, calculating the 
integral ^^(6*2) dx and applying the square root leads to p{92) oc l/\/^2 + l^^f + 48^2 + 64^21 [0.004,6] (^2)- Similar 
to the exponential regression example in the introduction, a uniform distribution on 92 space induces an informative 
distribution in the space of functional shapes. Shapes corresponding to larger values of 62 (say > 3) correspond to 
almost linear shapes, while only very small values of ^2 lead to more pronounced concave shapes. A uniform prior 
on 92 hence induces a prior that favors linear shapes over steeply increasing model shapes. 

We used importance sampling resamplin g based on a prop osal distribution generated by the iterated Laplace ap- 
proximation to implement the model (see. iBornkamd (|201l[ )V In Figure |4] one can observe the posterior uncertainty 
intervals under the two prior distributions. As is visible, the bias towards linear shapes, when using a uniform dis- 
tribution for 02 pertains in the posterior distribution. This happens despite the rather large sample size, and despite 
the fact that the response at the doses 0, 4 and particularly at dose 1 are not very well fitted by a linear shape. So 
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Figure 4: Posterior the dose-response curve under a uniform and a functional uniform prior for ^2- 

the posterior seems to be rather sensitive to the prior uniform distribution. The posterior based on the shape based 
functional uniform prior, in contrast, fits the data better at all doses, and seems to provide a more realistic measure 
of uncertainty for the dose-response curve, particularly for x G (0, 1). 

3.2 Simulations 

One might expect that the functional uniform prior distribution works acceptable no matter which functional shape 
is the true one. To investigate this in more detail and to compare this prior to other prior distributions in terms of 
their frequentist performance, simulation studies have been conducted. Here we report results from simulations in 
the context of binary nonlinear regression. 

For simulation the power model fi{x, 6) = Oo + Oix^^ will be used to model the response probability depending on x. 
The parameters and are hence subject to + < 1 and 6o,6i > 0, as a probability is modelled. Note that 
only 62 enters the model function non-linearly 

The doses 0, 0.05, 0.2, 0.6, 1 are to be used with equal allocations of 20 patients per dose. We use four scenarios in 
this case: in the first three cases the power model is used with Oq = 0.2 and 6i = 0.6, while O2 is equal to 0.4 (Power 
1), 1 (Linear) and 4 (Power 2). In addition we provide one scenario, where an Emax model 0.2 -|- 0.6/(a; -|- 0.05) 
is the truth. The Emax scenario is added to investigate the behaviour under misspecification of the model. Each 
simulation scenario will be repeated 1000 times. 
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Prior 


Model 


MAEi 


MAE2 


CP 


ILE 


Uniform 


Linear 


0.082 


0.062 


0.819 


0.259 




Power 1 


0.079 


0.056 


0.816 


0.255 




Power 2 


0.066 


0.067 


0.881 


0.220 




Emax 


0.073 


0.056 


0.780 


0.226 


Jeffreys 


Linear 


0.058 


0.065 


0.900 


0.233 




Power 1 


0.054 


0.056 


0.901 


0.220 




Power 2 


0.056 


0.073 


0.895 


0.227 




Emax 


0.056 


0.055 


0.845 


0.196 


Func. Unif. 


Linear 


0.060 


0.060 


0.892 


0.240 




Power 1 


0.057 


0.053 


0.893 


0.226 




Power 2 


0.057 


0.070 


0.912 


0.240 




Emax 


0.059 


0.054 


0.834 


0.203 



Table 1: Estimation of dose- response; MAEi and MAE2 correspond to the dose-response estimation error (for the 
posterior median and mode), CP denotes the average coverage probability of pointwise 0.9 credibility intervals and 
ILE denotes the average credibility interval lengths. 



We will compare the functional uniform prior distribution to the uniform distribution on the parameters and to the 
Jeffreys prior distribution. For the uniform prior distribution approach uniform prior distributions were assumed 
for all parameters, and the nonlinear parameter 62 was assumed to be within [0.05, 20] to ensure intcgrability. The 
same bounds are used for the two other approaches for comparability. For the functional uniform prior approach, 
uniform priors are used for a-nd 9i, while for 62 the functional uniform prior will be used on the function space 
defined by x^^ . The prior can be calculated to be p{92) oc 1/ yJwyi'+'&^^+'d2j2~\^. For the Jeffreys prior approach 



we used a prior proportional to ^det(/(0)), within the imposed parameter bounds. For analysis we used MCMC 
based on the HITRO algorithm, which is an MCMC sampler that combines the hit and run algorithm with the 
ratio of uniforms transformation. It does not need tuning and is hence we ll suited for a simulation study. The 



sampler is implemented in the Runu ran package (jLevdold and Hormann . 



20101 ) , computations were performed with R 



()R Development Core Teaml . I2OI1I .0000 MCMC samples are used from the corresponding posterior distributions 
in each case, using a burnin phase of 1000 a thinning of 2. 

In Table[l]one can observe the estimation results in terms of the mean absolute estimation error for the dose-response 
function, MAE = 1/9 Y^i=o where ^(.) is the underlying true function and is either the point wise 

posterior median (corresponding to MAEi) or the prediction corresponding to the posterior mode for the parameters 
(MAE2), the posterior mode for the uniform prior is equal to the maximum likelihood estimate. The values displayed 
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Uniform Prior 


Functional Uniform Prior 


Scenario 


iV 


MAEi 


MAE2 


CP 


ILE 


MAEi 


MAEa 


CP 


ILE 


Sig. Emax 1 


125 


0.256 


0.277 


0.903 


1.098 


0.230 


0.270 


0.914 


1.028 


Sig. Emax 2 




0.278 


0.283 


0.895 


1.144 


0.258 


0.278 


0.909 


1.089 


Sig. Emax 3 




0.243 


0.275 


0.902 


1.014 


0.251 


0.262 


0.898 


1.030 


Linear 




0.266 


0.291 


0.901 


1.100 


0.241 


0.289 


0.918 


1.057 


Quadratic 




0.272 


0.278 


0.880 


1.109 


0.242 


0.276 


0.898 


1.038 


Sig. Emax 1 


250 


0.185 


0.214 


0.908 


0.818 


0.167 


0.209 


0.920 


0.768 


Sig. Emax 2 




0.196 


0.206 


0.908 


0.850 


0.187 


0.201 


0.910 


0.811 


Sig. Emax 3 




0.174 


0.202 


0.913 


0.738 


0.170 


0.188 


0.912 


0.744 


Linear 




0.200 


0.209 


0.891 


0.831 


0.189 


0.211 


0.900 


0.794 


Quadratic 




0.201 


0.215 


0.881 


0.839 


0.185 


0.216 


0.886 


0.782 



Table 2: Estimation of dose-response; MAEi and MAE2 correspond to the estimation error at the doses 0,1,...,8 
(for the posterior median and the posterior mode), CP denotes the average coverage probability of pointwise 0.9 
credibility intervals and ILE denotes the average credibility interval lengths. 

in Table [T] are the average MAE over 1000 repetitions. In addition for each simulation the 0.9 credibility intervals at 
the dose-levels 0, 1/8, 2/8, 1 have been calculated. The number given in the Table is CP = I/SELo-^Vs. where 
Pd is the average coverage probability of the 0.9 credibility interval at dose d over 1000 simulation runs. In addition 
the average length of the credibility intervals has been calculated as ILE = 1/9 X)^=o where Ld is the average 
length of the 0.9 credibility interval at dose d over 1000 simulation runs. For estimation of the dose-response Jeffreys 
prior and the functional uniform prior improve upon the uniform prior distribution, while the Jeffreys prior and the 
functional uniform prior are close, with slight advantages for the Jeffreys prior. In terms of the credibility intervals 
the functional uniform and Jeffreys prior roughly keep their nominal level for the linear, and the power model cases, 
while the uniform prior probability does not. None of the priors achieves the nominal level for the Emax model, 
which is probably due to the fact that the Emax model is too different from the power model. Interestingly the 
credibility intervals of the uniform prior are larger than those of the other two priors, but lead to a smaller coverage 
probability. 

Table [2] provides the estimation results with respect to parameter estimation. The main message here is that all 

priors perform roughly equal for estimation of the linear parameters 0q and 9i. For the nonlinear parameters, Jeffreys 

prior distribution and the functional uniform prior perform better than the uniform disribution. 

In summary the functional uniform prior hence performs roughly equally well as the Jeffreys prior in these simulations. 

However, the functional uniform prior has the pragmatic and conceptual advantages that it does not depend on the 

observed covariates, and can thus be used for example for calculation of a Bayesian optimal design, or in sequential 

situations. 
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Figure 5: Efficiency of the two designs for different shapes. 
3.3 Bayesian optimal design for exponential regression 

In this section we will use the prior distribution for the exponential regression model derived in Section[2?2]to calculate 
a Bayesian optimal design. When assuming a homoscedastic normal model, the Fisher information is /(d, 0) oc 
^ Wixf exp(— 20Xi). Hence minimizing — log(/((i, 9)) will lead to a design with most information. Unfortunately the 
expression depends on 9, which is of course unknown before the experiment. One way of dealing with this uncertainty 
are Bayesian optimal designs, where one optimizes the design criterion averaged with respect to a prior distribution: 
— / log(/(d, 9))p{9)d9. In this situation we will use the uniform and functional uniform prior distribution (see Figures 
[T]and|3]) both on the interval [0, 5] for calculation of the optimal design. Restricting the design space to x € [0, 10] 
and only performing the optimization up to 5 design points, one ends up with the weights w = (0.956,0.022,0.022) 
on the design points x — (0.38,4.04, 10), for the uniform prior, while the functional uniform prior distribution leads 
to a design of the form w = (0.19,0.3,0.51) and x = (0.54,2.35,10). The design corresponding to the functional 
uniform prior hence spreads its allocation weights more uniformly on the design range, whereas the uniform prior 
results in essentially one major design point. 

One way of comparing the two calculated designs is to look at the efficiency EfF((i, 9) = exp(log(J((i, 9))—\og{I{dopt{9), 9))), 
of the calculated designs, with respect to the design dopt{9) that is locally optimal for the parameter value 9, for a 
range of different shapes. In Figure [5] we plot the efficiency for the different shapes on the functional shape scale. 
One can observe that the uniform prior design is only efficient for the sharply decreasing shapes with 9 > 0.71, 
but otherwise has very low efficiency. The functional uniform prior improves quite a bit over the uniform prior 
distribution for most of the functional shape space, and provides at least a reasonable efficiency for most shapes. 
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4 Conclusions 



A main motivation for this work is the practical hmitation of the classical Jeffreys prior that it cannot be used in 
nonlinear regression settings, where the prior needs to be specified before data collection, for example when one wants 
to calculate a Bayesian optimal design or in adaptive dose-finding trials. For this purpose the functional uniform 
distribution has been introduced, which imposes a distribution on the parameters, so that it is uniform in the 
functional shapes underlying the nonlinear regression func tion. Thi s was achie ved by using a gene ral framework for 



constructing uniform distributions based on earlier work by 



Dembskil (| 19901 ) and 



Ghosal et al 



|l997t ). We investigated 



the functional uniform prior for nonlinear regression in a real example, a simulation study and an optimal design 
problem where it showed very satisfactory performance. 

There is no reason to call the priors proposed in this article globally uninformative, because one needs to choose 
the space M and in particular the metric d, where to impose the uniform distribution. The priors derived from the 
theory in Section [2.11 might then be considered uninformative in the particular aspect that (A/, d) reflects. In the case 
of nonlinear regression we argue that the uniform distribution on the space of functional shapes is often, depending 
of course on the considered application, a reasonable assumption for nonlinear regression when particular prior 
information is lacking. However, this also does not apply generally: A situation, where the functional uniform prior 
might not be adequate, occurs, for example, when the considered nonlinear model is extremely flexible, containing 
virtually all continuous functions for example (as neural network models). In this case it is often more adequate to 
concentrate most prior probability on a reasonable subset of the function space (e.g., smooth functions), rather than 
building a uniform distribution on all potential shapes, including shapes that might be implausible a-priori. 
The theory outlined in Section [2] might be of interest to formulate functional uniform priors also for other type 
of models with a nonlinear aspect. In quite a few modelling situations one might be able to flnd a space {M,d), 
where imposing a u niform distribution is plausible and then back-transform this distribution to the parameter scale. 

(|1997I ) employ this idea, when (M, d) is a space of densities and define priors for nonparam etric density 

120091 ) discuss the 



Ghosal et al 



Drvden et al 



estimation. Another application could be the estimation of covariance matrices: 
use of more adequate non-Euclidean distance metrics for covariance m atrices, which would in our framework define 
the metric space for imposing the uniform distribution. iPauld (|2005l ) derives default priors for Gaussian process 



interpolation, which are rather time consuming to evaluate. In this situation might choose the space of the covariance 
functions as {M, d) . 



A Proof of Theorem 1 



Ghosal et al 



(jl997[ ) prove a closely related result, when the underlying metric is the Hellinger distance and M the 



space of residual densities. We review their pro of and adapt to metrics of the form (1) and proceed in two parts. Part 



A summarizes the proof of 



Ghosal et al 



(|l997f ) for completeness and part B provides additional Lemmas needed in 
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our situation. 
Part A 

The proof starts by covering with hypercubes and inner hypercubes placed inside these cubes. Let Ai, . . . , Aj be 
the intersections of A with the hypercubes and A[, . . . , A'j be the intersections of A with the inner hypercubes. Now 
separate the hypercubes and inner hypercubes so that each inner hypercube is at least e apart from any other in the 
ci(., .) metric (this is possible, when the results proved in Lemma 1 hold). By the sub-additivity of packing numbers 
one then has D{e, A'^,d) < D{e, A, d) < D{e, Aj,d) and Y.j D{e, ®'pd) < D{e, 0, d) < D{e, @j,d), where 
0^- and &j denote the intersection with the hypercubes and inner hypercubes. 

Now an upper an lower bound for D{e, Aj,d) is derived based on the local Eucli dean approximation (1) to the metri c 



d. For a Euclidean metric one can calculate the packing number explicitly, see iKolmogorov and Tihomirovi (|1961[ ) 



Up to proportionality D{e,A, ||.||) is given by vol{A)e p, consequently for a metric of form \/ [6 — 9')'^V{9 — 6'), 



with V a fixed positive definite matrix the packing number is up to proportionality -y/ det{V)vol{A)e p. Using 
the local Euclidean approximation (1) and Lemma 2 one can derive lower and upper bounds for D{e, Aj,d) and 



D{e,A'^,d) in terms of y/det{V {6 ■i))vol{Aj)e-P and y/det{V {6 ■i))vol{A'j)e~P. and similarly for D{e,&,d) and thus 

for Pe{A) — D{e,A,d)/D{e,@,d). As the size of the hypercubes goes to z ero the bounds bec ome sharper (see 

T r,\ 1 1 1 1 1 / j\ \/dct(v(e))de 

Lemma 2) and lower and upper bound Ft (A) converge to ^ ^ , , see 



Ghosal et al 



( 1997 ) for the details of 



this argument. 
Part B 

Without loss of generality we focus on setting ci = C2 = 1 in (1) for what follows. 

Lemma 1 For symmetric, positive definite V{d) there exist l*,u* > for 9,6' E & so that 

i*\\e~e'\\ <d{e,e') <u*\\e-e'\\ 

Proof: 

d(6», 6')^ {9- eYv{e'){e - e') + o{\\e - e'\\p) 



||6>-6>'||2 \\o-e'\\^ 

Now by an eigendecomposition and the compactness of and continuity of V{6) one knows that there exist l,u > 
so that 

1(6 ~ 9' fie - e') <{e- e'fv{0){e - e') < u{e - e'f{e - e'). 

So that in total we get a lower and upper bound by u*"^ = u + k, where k = max(maxeg© O(||0 — ©'H^^^), 0), and 
similarly lower bounded. □ 

Lemma 2 For 0,9', 9* lying in a hypercube Q C & we obtain 

ki{9 - 9'fV{9*){9 - 9') < d^{9, 9') < k2{9 - 9'fV{9*){9 - 9'), 
where ki — >■ cq and k2 — >■ cq for cq > 0, when the side length of Q converges to 0. 
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Proof: 



d\e,e') _ {e-e'fv{e'){e-e') ^ o{\\d-e'\\p) 



{0-e'yv{e*){e-e') {0-e'Yv{e*){e-e') {e-e'Yv{d*){e-e') 

{e-e'Yv(e'){0-e') 



{d -e')Tv{e*)(e -e') 



o{\\e-e'\r^) 



Now V{9) is continuous, so one can lower and upper bound the second sumniand on Q. It converges to zero and 
hence the bounds towards each other when the size of the hypercubes shrinks (and by this 6* 6'). Upper and 
lower bounding O(||0 — implies the desired result. □ 



B Proof of Theorem 2 

Consider the distance metric rf*(7, 7o) = d{h{'j),h{'yQ)). To show invariance of the proposed procedure, the uniform 
distribution derived from d* needs to be ^(7) oc det(ilf(7))-\/det(V(/i(7))), which is the distribution derived from 
p{e) oc Vdet(V(6>)) using a change of variables. 

A second order Taylor expansion of d^{h{'y), /i(7o)) in 7o leads to an approximation of the form (7 — 7o)'A^(7o)(7 — 

7o), where i,j element of M(7) is given by M{j)^ij) = X;f=i ELi gf-/ii(7)+Efe=i ^^i^)a^f^ki'7), 

where m{6) = Oq) and 6 = h{'y). When evaluating this expression in the expansion point the second summand 

vanishes as the gradient is zero. Hence one obtains M{'j) = H{'y)'^V{h{'y))H{'j), which results in the density 

Ph) oc Vdet(Jf(7))^V(/i(7)) det(ff(7)) = det(Jf(7)) Vdet(V(/i(7))). □ 
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